Programmable Counters for Counting Floating-Point Operations in SIMD Processors

ABSTRACT

A processor includes one or more execution units to execute instructions, each having one or more elements in different element sizes using one or more registers in different register sizes. The processor further includes a counter configured to count a number of instructions performing predetermined types of operations executed by the one or more execution units. The processor further includes one or more registers to allow an external component to configure the counter to count a number of instructions associated with a combination of a register size and a element size (register/element size) and to retrieve a counter value produced by the counter.

TECHNICAL FIELD

Embodiments of the present invention relate generally to performancemonitoring of processors. More particularly, embodiments of theinvention relate to programmable counters for counting floating-pointoperations of a processor.

BACKGROUND ART

The high-performance computing (HPC) community, both hardware vendorsand software developers, rely on an accurate count of floating-pointoperations executed. These measurements are used in a variety of ways,including distinguishing a system's actual computing floating-pointoperation (FLOP) performance compared to its advertised peak FLOPperformance, and analyzing applications for the percentage of scalarFLOPs compared with packed FLOPs. Static analysis of the application toobtain this information can be difficult because during the execution,codes paths through the application may vary based on dynamicconditions, such as array alignment in memory, loop iteration countsdependent upon input problem size, and loop iteration counts dependenton algorithmic convergence requirements. Scalar operations are oftenused when data packing is not possible due to memory communicationbetween the loop iterations, and are also used to “peel” iterations of aloop to achieve a particular memory alignment for packed memoryoperations.

FLOP has a precise definition within the HPC community, and it refers tosingle- or double-precision arithmetic operations (i.e., add, subtract,multiply, and divide), and does not include memory or logicaloperations. The some compound instructions, such as Fused Multiply Add(FMA) instructions count as multiple, in this example, two FLOPS, onefor the multiply and one for the add. Each element in a packedsingle-instruction-multiple-data (SIMD) arithmetic operation counts as aFLOP (two in the case of an FMA). For example, a 256-bit packedsingle-precision (32-bit) floating-point add operates on 8 elements, andthus counts 8 FLOPs. Scalar operations use the full SIMD register datapath, but only operate on a single element, and therefore only count 1FLOP (2 in the case of FMA). There has been a lack of efficientmechanism that can accurately count the FLOPs in such an operatingenvironment.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example and notlimitation in the figures of the accompanying drawings in which likereferences indicate similar elements.

FIG. 1 is a block diagram illustrating a system for counting FLOPsaccording to one embodiment of the invention.

FIG. 2 is a block diagram illustrating an example of a processoraccording one embodiment of the invention.

FIG. 3 is a block diagram illustrating mapping of subevents used toprogram GPCs according one embodiment of the invention.

FIG. 4 is a block diagram illustrating mapping of subevents used toprogram GPCs according another embodiment of the invention.

FIGS. 5A and 5B are flow diagrams illustrating a method for countingarithmetic operations according to some embodiments of the invention.

FIG. 6 is a flow diagram illustrating a method for determiningarithmetic operations performed by certain instructions according toanother embodiment of the invention.

FIG. 7 is a block diagram illustrating an example of a data processingsystem according to one embodiment.

FIG. 8 is a block diagram illustrating an example of a data processingsystem according to another embodiment.

DESCRIPTION OF THE EMBODIMENTS

Various embodiments and aspects of the inventions will be described withreference to details discussed below, and the accompanying drawings willillustrate the various embodiments. The following description anddrawings are illustrative of the invention and are not to be construedas limiting the invention. Numerous specific details are described toprovide a thorough understanding of various embodiments of the presentinvention. However, in certain instances, well-known or conventionaldetails are not described in order to provide a concise discussion ofembodiments of the present inventions.

Reference in the specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin conjunction with the embodiment can be included in at least oneembodiment of the invention. The appearances of the phrase “in oneembodiment” in various places in the specification do not necessarilyall refer to the same embodiment.

According to some embodiments, one or more counters, such asgeneral-purpose counters (GPCs), specific-purpose or fixed counters, ofa processor or processor core are programmed to count FLOPs performed byspecific instructions in various combinations of instruction typesand/or instruction sizes. A set of one or more registers are configuredas a counter access interface of the counters to allow a softwarecomponent to specifically configure which of the counters to countnumber of a particular type of instructions executed or retiredrepresenting a particular type of arithmetic operations such as FLOPsperformed by the instructions in which of combinations of instructiontypes and/or instruction sizes, which may be represented by registersizes and/or element sizes (e.g.,32-bit/64-bit/128-bit/256-bit/512-bit/1024-bit, scalar/packed,single/double precision). The counters are configured to count a numberof instances of different combination of instructions with differentinstruction types/sizes executed or retired. Based on differentinstruction types/sizes, the software component can apply differentfactors such as multipliers to compute the actual number of arithmeticoperations performed by the instructions counted. Further, combinationsof register sizes and elements sizes that result in the same arithmeticoperations factor or multiplier could be counted at the same time in thesame counter. In one embodiment, instead of counting the arithmeticoperations of instructions prior to or at the time of execution (whichmay or may not actually be executed and retired) used by a conventionalmethod, the counters are configured to count instances of theinstructions to represent the arithmetic operations of the instructionsthat have actually been executed and retired from the execution units.As a result, the calculated arithmetic operations are far more accuratethan the conventional methods. Throughout this application, GPCs areutilized as examples of counters of a processor or processor core;however, other types of counters such as specific-purpose or fixedcounters (e.g., specifically configured or hardwired to count certainevents) can also be applied herein. In addition, FLOPs are utilized asexamples of arithmetic operations to be calculated; other arithmeticoperations such as shifts, etc., can also be applied herein.

FIG. 1 is a block diagram illustrating a system for counting FLOPsaccording to one embodiment of the invention. Referring to FIG. 1,system 100 includes one or more applications (e.g., performance analyticapplications) to access processor 104 via operating system 103.Specifically, according to one embodiment, processor 104 includes a setof counters 108-110 to count number of particular types of instructionsretired representing certain types of arithmetic operations such asFLOPs performed by instructions executed by one or more execution units111. Different counters can be programmed by a software component suchas applications 101-102 to count FLOPs performed by instructions of aparticular type and size, referred to herein as a combination ofinstruction type/size. According to one embodiment, processor 104includes programmable counter interface 107 to allow a softwarecomponent to program counters 108-110 and to retrieve the count valuesproduced by counters 108-110.

In one embodiment, operating system 103 includes an applicationprogramming interface (API) 105 to allow applications 101-102 to accesscertain functionalities of operating system 103 and one or more devicedrivers 106 configured to access certain hardware and/or firmware ofsystem 100. In this embodiment, device driver 106 is running at aprivileged level of operating system 103 (e.g., kernel level or ringzero level or supervisor level) specifically configured to access GPCs108-110. That is, applications 101-102 do not have privileges todirectly access GPCs 108-110; rather, applications 101-102 call one ormore specific function calls to API 105, which in turn accesses devicedriver 106. Device driver 106 then accesses programmable counterinterface 106 to program GPCs 108-110 and/or to retrieve count valuesfrom GPCs 108-110.

According to one embodiment, programmable counter interface 107 mayinclude a set of one or more registers that can be accessed by devicedriver 106. For example, the set of one or more registers may be a setof one or more model specific registers (MSRs) of which device driver106 can specify which of counters 108-110 to compute FLOPs performed byinstructions of a particular type or types (e.g., opcodes presentinginstructions such as ADD, SUB, MUL, DIV, MIN, MAX, RECIP, SQRT, FMA,etc.) in a particular size or width (e.g., 32-bit, 64-bit, 128-bit,256-bit, 512-bit, or 1024-bit, scalar or packed). In one embodiment, aGPC may be selected and programmed based on a particular register sizeand an element size (e.g., single or double precision) or a number ofelements packed (e.g., scalar or packed instruction) within a particulartype of instructions. In one embodiment, instead of compute the FLOPs ofinstructions prior to or at the time of execution (which may or may notactually be executed) used by a conventional method, the counters108-110 are configured to count number of instances of instructionsperforming the FLOPs that have actually been executed and retired fromthe execution units 111. As a result, the counted FLOPs are far moreaccurate than the conventional method.

FIG. 2 is a block diagram illustrating an example of a processoraccording one embodiment of the invention. Referring to FIG. 2,processor 104 may represent any kind of instruction processingapparatuses. For example, processor 104 may be a general-purposeprocessor. Processor 104 may be any of various complex instruction setcomputing (CISC) processors, various reduced instruction set computing(RISC) processors, various very long instruction word (VLIW) processors,various hybrids thereof, or other types of processors entirely. In oneembodiment, processor 104 includes, but is not limited to, instructionfetch unit 201, instruction decoder 202, one or more execution units203, retirement unit 204, and GPC counter unit 205 having programmableGPCs 108-110, which are accessible by a software component via MSRs 206.

Instruction fetch unit 201 is configured to fetch or prefetchinstructions from an instruction cache or data from memory. Instructiondecoder 202 is to receive and decode instructions from instruction fetchunit 201. Instruction decoder 202 may generate and output one or moremicro-operations, micro-code, entry points, microinstructions, otherinstructions, or other control signals, which reflect, or are derivedfrom, the instructions. Instruction decoder 202 may be implemented usingvarious different mechanisms. Examples of suitable mechanisms include,but are not limited to, microcode read only memories (ROMs), look-uptables, hardware implementations, programmable logic arrays (PLAs), andthe like.

Execution units 203, which may include an arithmetic logic unit, oranother type of logic unit capable of performing operations based oninstructions, which can be micro-operations or μOps). As a result ofinstruction decoder 202 decoding the instructions, execution unit 203may receive one or more micro-operations, micro-code entry points,microinstructions, other instructions, or other control signals, whichreflect, or are derived from, the instructions. Execution unit 203 maybe operable as a result of instructions indicating one or more sourceoperands (SRC) and to store a result in one or more destination operands(DEST) of a register set indicated by the instructions. Execution unit203 may include circuitry or other execution logic (e.g., softwarecombined with hardware and/or firmware) operable to execute instructionsor other control signals derived from the instructions and perform anoperation accordingly. Execution unit 203 may represent any kinds ofexecution units such as logic units, arithmetic logic units (ALUs),arithmetic units, integer units, etc.

Some or all of the source and destination operands may be stored inregisters of a register set or memory. The register set may be part of aregister file, along with potentially other registers, such as statusregisters, flag registers, etc. A register may be a storage location ordevice that may be used to store data. The register set may often bephysically located on die with the execution unit(s). The registers maybe visible from the outside of the processor or from a programmer'sperspective. For example, instructions may specify operands stored inthe registers. Various different types of registers are suitable, aslong as they are capable of storing and providing data as describedherein. The registers may or may not be renamed. Examples of suitableregisters include, but are not limited to, dedicated physical registers,dynamically allocated physical registers using register renaming,combinations of dedicated and dynamically allocated physical registers,etc. Alternatively, one or more of the source and destination operandsmay be stored in a storage location other than a register, such as, forexample, a location in system memory.

Referring back to FIG. 2, according to one embodiment, GPCs 108-110 ofGPC unit 205 are programmed to count instances of specific instructionsperforming FLOPs in various combinations of instruction types and/orinstruction sizes. A set of one or more registers MSRs 206 areconfigured as a counter access interface of the counters 108-110 toallow a software component to specifically specify which of the counters108-110 to count instances of specific instructions performing certaintypes of operations such as FLOPs in which of combinations ofinstruction types and/or instruction sizes, which may be represented byregister sizes and element sizes (e.g.,32-bit/64-bit/128-bit/256-bit/512-bit/1024-bit, scalar/packed,single/double precision). The counters 108-110 are configured to countinstances of specific instructions with the associated instructiontype/size that perform the FLOPs. Based on different instructiontypes/sizes, the software component can apply different factors such asmultipliers to compute the actual number of FLOPs performed by theinstructions. In one embodiment, counters 108-110 are configured tocount instances of specific instructions performing the FLOPs that haveactually been executed by execution unit 203 and retired by retirementunit 204.

According to one embodiment, when an instruction has been executed byexecution unit 203, retirement unit 204 is to identify and select one ofcounters 108-110 based on the instruction type and the elements of theinstruction. Retirement unit 204 is then to send a signal to theselected counter to cause the selected counter to increment by anincremental value. In addition, according to one embodiment, if theinstruction is a special type of instructions (e.g., combo instructions)that performs multiple FLOPs, which may be indicated by instruction typeindicator 207, retirement unit 204 is to signal to the selected GPC toincrement multiple incremental values equivalent to the number ofindividual instructions per element represented therein. Instructiontype indicator 207 may be detected by retirement unit 204 oralternatively, by instruction decoder 202 during instruction decoding.For example, a fuse multiply add (FMA) instruction causes a processor toperform a multiplication and addition operations, which counts for twoFLOPs. In such a situation, retirement unit 204 is to cause thecorresponding counter to count two instances of instructions.

In one embodiment, any of counters 108-110 can be programmed by asoftware component via MSRs 206, by specifying a main event 208 andsubevent 209. Main event 208 is one of the predefined events to accesscounters 108-110 to count the types of instructions or opcodes such asADD, SUB, MUL, DIV, MIN, MAX, RECIP, SQRT, FMA, etc. Subevent 209 is tospecify the elements associated with the instructions, such ascombinations of register sizes and element sizes. In one embodiment,multiple subevents can be counted by a single counter. The softwarecomponent can also retrieve the count values of counters 108-110 viaMSRs 206, for example, either operating in an interrupt mode oroperating in a polling mode.

FIG. 3 is a block diagram illustrating mapping of subevents used toprogram GPCs according one embodiment of the invention. Referring toFIG. 3, main event 208 is to program the counters to count number ofinstances of instructions performing FLOPs. A software component canwrite main event 208 to a predetermined MSR register by specifyingFP_ARITH_INST_RETIRED, which instructs the GPCs to count FLOPs for apredefined set of instructions such as ADD, SUB, MUL, DIV, MIN, MAX,RECIP, SQRT, and FMA instructions. Subevent 209 includes a set ofsubevents, each corresponding to a type 301 of instruction representedby a combination of register sizes (e.g., 32-bit, 64-bit, 128-bit,256-bit, 512-bit, and 1024-bit) and element sizes (e.g., scalar/packed,single/double precision). A GPC may be programmed to count one or moreof these types 301 of instructions. Instructions of differentcombinations of register sizes and element sizes may perform differentnumbers of FLOPs. A software component that retrieves the count valueform the GPCs is responsible applying multiplier 302 to calculate thetotal FLOPs. For example, an instruction for 512-bit packed instructionwith double precision (subevent 6) has 8 FLOPs. When a GPC programmed tocount FP_ARITH_INST_RETIRED subevent 6 receives a retirement indicationfor this 512-bit packed double precision arithmetic instruction from aretirement unit, the counter increment its count value by one. However,when the software component retrieves the count values, it may multiplythe count value by a multiplier of 8.

Thus, the total FLOPs for an application can be obtained by counting thenumber of instructions retired for each register size and element sizecombination, then multiplying by the number of elements in thatcombination, then accumulating across the combinations. The subeventcontrol mask 209 specifies which types of instructions will be counted.Multiple subevents can be selected simultaneously. For example, allscalar operations (single- or double-precision) can be counted bysetting bit 0 to logical value one and bit 1 to logical value one in thesubevent mask. A software consumer then multiplies the count by a knownoperation count (e.g., multiplier 302) for that subevent.

Note that 256-bit double-precision and 128-bit single-precision have thesame FLOP count of 4 since both have 4 elements, but have separatesubevents to support single vs. double precision counting. The totalFLOPs would then be the sum of each of counter results, multiplied withthe corresponding multiplier:

FLOPs=1*(scalar_single and scalar_double)+2*(128 b packed double)+4*(256b_packed_double and 128 b_packed_single)+8*(256 b packed single)

The total FLOPs count can be obtained in a single run of the applicationby simultaneously utilizing, for example, four performance monitoringcounters, each programmed to the HPC FLOPs configuration, but withdifferent subevents. This subevent configuration also allows forbundling commonly used types: scalar vs. packed and single vs. double,using fewer GPCs as shown in FIG. 4. Note that although only sevensubevents have been described, more or fewer subevents may also beapplied. Also note that throughout this application, embodiments of theinvention are used to count a specific set of arithmetic operations, itis not so limited, and other types of operations may also be counted,such as shifts or ANDs.

FIG. 5A is a flow diagram illustrating a method for counting FLOPsaccording to one embodiment of the invention. Method 500 may beperformed by processor 104. Referring to FIG. 5A, at block 501, acommand is received via a counter access interface (e.g., MSRs) toprogram one or more counters of a processor, where the command specifiesthe types of instructions (e.g., main event and subevents) to be countedby the counters. At block 502, the counters are configured based on thecommand, including configuring a first counter (e.g., GPC) to countinstructions of a first type having a first combination of a registersize and an element size (register/element size) and configuring asecond counter to count instructions of a second type having a secondcombination of register/element size that is different than the firstcombination. Subsequently, at block 503, in response to instructionsretired from an execution unit, the programmed counters are to count theretired instructions based on different combinations ofregister/elements sizes, including the first and second combinations. Atblock 504, the count values are enabled to be accessible to software viaa counter access interface (e.g., MSRs). FIG. 5B is a flow diagramillustrating a counting embodiment based on the subevents as shown inFIG. 3.

FIG. 6 is a flow diagram illustrating a method for determining number ofarithmetic operations performed by certain instructions according oneembodiment of the invention. Method 600 may be performed by a softwareapplication such as applications 101-102 of FIG. 1. Referring to FIG. 6,at block 601, processing logic configures, via a counter accessinterface such as MSR registers, a counter of a processor or processorcore to count number of instructions executed by the processor, wherethe instructions correspond to one or more combinations of registersizes and element sizes. For example, processing logic may specify amain event and a subevent to specifically select and program aparticular counter of the processor to count instances of one or moretypes of instructions with one or more combinations of register sizesand element sizes, as shown in FIG. 3. The processing logic mayconfigure a counter to count instructions with different combinations ofregister sizes and element sizes. According to one embodiment,instructions with different combinations of register sizes and elementsizes would be counted in the same counter if they are associated withthe same factor or multiplier (e.g., performing same amount ofarithmetic operations in a cycle), as shown in FIG. 4. Subsequently, atblock 602, processing logic retrieves a counter value of the programmedcounter from the processor via the counter access interface and at block603, the processing logic applies a predetermined factor to the countervalue to derive a number of arithmetic operations performed by theinstructions.

FIG. 7 is a block diagram illustrating an example of a data processingsystem according to one embodiment of the invention. System 900 mayrepresent any of the systems described above. For example, system 900may represent a desktop, a laptop, a tablet, a server, a mobile phone(e.g., Smartphone), a media player, a personal digital assistant (PDA),a personal communicator, a gaming device, a network router or hub, awireless access point or repeater, a set-top box, or a combinationthereof. Note that while FIG. 7 illustrates various components of a dataprocessing system, it is not intended to represent any particulararchitecture or manner of interconnecting the components; as suchdetails are not germane to embodiments of the present invention. It willalso be appreciated that network computers, handheld computers, mobilephones, and other data processing systems which have fewer components orperhaps more components may also be used with embodiments of the presentinvention.

Referring to FIG. 7, in one embodiment, system 900 includes processor901 and chipset 902 to couple various components to processor 901including memory 905 and devices 903-904 via a bus or an interconnect.Processor 901 may represent a single processor or multiple processorswith a single processor core or multiple processor cores 909 includedtherein. Processor 901 may represent one or more general-purposeprocessors such as a microprocessor, a central processing unit (CPU), orthe like. More particularly, processor 901 may be a complex instructionset computing (CISC) microprocessor, reduced instruction set computing(RISC) microprocessor, very long instruction word (VLIW) microprocessor,or processor implementing other instruction sets, or processorsimplementing a combination of instruction sets. Processor 901 may alsobe one or more special-purpose processors such as an applicationspecific integrated circuit (ASIC), a field programmable gate array(FPGA), a digital signal processor (DSP), a network processor, agraphics processor, a network processor, a communications processor, acryptographic processor, a co-processor, an embedded processor, or anyother type of logic capable of processing instructions. For example,processor 901 may be a Pentium® 4, Pentium® Dual-Core, Core™ 2 Duo andQuad, Xeon™, Itanium™, XScale™, Core™ i7, Core™ i5, Celeron®, orStrongARM™ microprocessor available from Intel Corporation of SantaClara, Calif. Processor 901 is configured to execute instructions forperforming the operations and steps discussed herein.

Processor 901 may include an instruction decoder, which may receive anddecode a variety of instructions. The decoder may generate and outputone or more micro-operations, micro-code entry points,microinstructions, other instructions, or other control signals, whichreflect, or are derived from, an original input instruction. The decodermay be implemented using various different mechanisms. Examples ofsuitable mechanisms include, but are not limited to, microcode read onlymemories (ROMs), look-up tables, hardware implementations, programmablelogic arrays (PLAs), and the like.

The decoder may not be a required component of processor 901. In one ormore other embodiments, processor 901 may instead have an instructionemulator, an instruction translator, an instruction morpher, aninstruction interpreter, or other instruction conversion logic. Variousdifferent types of instruction emulators, instruction morphers,instruction translators, and the like, are known in the arts. Theinstruction conversion logic may receive the bit range isolationinstruction, emulate, translate, morph, interpret, or otherwise convertthe bit range isolation instruction, and output one or more instructionsor control signals corresponding to the original bit range isolationinstruction. The instruction conversion logic may be implemented insoftware, hardware, firmware, or a combination thereof. In some cases,some or all of the instruction conversion logic may be located off-diewith the rest of the instruction processing apparatus, such as aseparate die or in a system memory. In some cases, the instructionprocessing apparatus may have both the decoder and the instructionconversion logic.

Processor 901 and/or cores 909 may further include one or more executionunits coupled with, or otherwise in communication with, an output of thedecoder. The term “coupled” may mean that two or more elements are indirect electrical contact or connection. However, “coupled” may alsomean that two or more elements are not in direct connection with eachother, but yet still co-operate or interact or communicate with eachother (e.g., through an intervening component). As one example, thedecoder and the execution unit may be coupled with one another throughan intervening optional buffer or other component(s) known in the artsto possibly be coupled between a decoder and an execution unit.Processor 901 and/or cores 909 may further include multiple differenttypes of execution units, such as, for example, arithmetic units,arithmetic logic units (ALUs), integer units, etc.

Processor 901 may further include one or more register files including,but are not limited to, integer registers, floating point registers,vector or extended registers, status registers, and an instructionpointer register, etc. The term “registers” is used herein to refer tothe on-board processor storage locations that are used as part ofmacro-instructions to identify operands. In other words, the registersreferred to herein are those that are visible from the outside of theprocessor (from a programmer's perspective). However, the registersshould not be limited in meaning to a particular type of circuit.Rather, a register need only be capable of storing and providing data,and performing the functions described herein. The registers describedherein can be implemented by circuitry within a processor using anynumber of different techniques, such as dedicated physical registers,dynamically allocated physical registers using register renaming,combinations of dedicated and dynamically allocated physical registers,etc. In one embodiment, integer registers store 32-bit or 64-bit integerdata. A register file may contain extended multimedia SIMD registers(e.g., XMM) for packed data. Such registers may include 128-bit wideregisters (e.g., XMM registers), 256-bit wide registers (e.g., YMMregisters which may incorporate the XMM registers in their low orderbits), and 512-bit wide registers, relating to SSE2, SSE3, SSE4, GSSE,and beyond (referred to generically as “SSEx”) technology to hold suchpacked data operands. Wider instructions and/or registers such as1024-bit or greater can also be applied.

Processor 901 and/or cores 909 may also optionally include one or moreother well-known components. For example, processor 901 may optionallyinclude instruction fetch logic, pre-decode logic, scheduling logic,re-order buffers, branch prediction logic, retirement logic, registerrenaming logic, and the like, or some combination thereof. Thesecomponents may be implemented conventionally, or with minor adaptationsthat would be apparent to those skilled in the art based on the presentdisclosure. Further description of these components is not needed inorder to understand the embodiments herein, although further descriptionis readily available, if desired, in the public literature. There areliterally numerous different combinations and configurations of suchcomponents known in the arts. The scope is not limited to any known suchcombination or configuration. Embodiments may be implemented either withor without such additional components.

Chipset 902 may include memory control hub (MCH) 910 and input outputcontrol hub (ICH) 911. MCH 910 may include a memory controller (notshown) that communicates with a memory 905. MCH 910 may also include agraphics interface that communicates with graphics device 912. In oneembodiment of the invention, the graphics interface may communicate withgraphics device 912 via an accelerated graphics port (AGP), a peripheralcomponent interconnect (PCI) express bus, or other types ofinterconnects. ICH 911 may provide an interface to I/O devices such asdevices 903-904. Any of devices 903-904 may be a storage device (e.g., ahard drive, flash memory device), universal serial bus (USB) port(s), akeyboard, a mouse, parallel port(s), serial port(s), a printer, anetwork interface (wired or wireless), a wireless transceiver (e.g.,WiFi, Bluetooth, or cellular transceiver), a media device (e.g.,audio/video codec or controller), a bus bridge (e.g., a PCI-PCI bridge),or a combination thereof.

MCH 910 is sometimes referred to as a Northbridge and ICH 911 issometimes referred to as a Southbridge, although some people make atechnical distinction between them. As used herein, the terms MCH, ICH,Northbridge and Southbridge are intended to be interpreted broadly tocover various chips who functions include passing interrupt signalstoward a processor. In some embodiments, MCH 910 may be integrated withprocessor 901. In such a configuration, chipset 902 operates as aninterface chip performing some functions of MCH 910 and ICH 911, asshown in FIG. 8. Furthermore, graphics accelerator 912 may be integratedwithin MCH 910 or processor 901.

Memory 905 may store data including sequences of instructions that areexecuted by processor 901, or any other device. For example, executablecode 913 and/or data 914 of a variety of operating systems, devicedrivers, firmware (e.g., input output basic system or BIOS), and/orapplications can be loaded in memory 905 and executed by processor 901.An operating system can be any kind of operating systems, such as, forexample, Windows® operating system from Microsoft®, Mac OS/iOS fromApple, Android® from Google®, Linux®, Unix®, or other real-timeoperating systems. In one embodiment, memory 905 may include one or morevolatile storage (or memory) devices such as random access memory (RAM),dynamic RAM (DRAM), synchronous DRAM (SDRAM), static RAM (SRAM), orother types of storage devices. Nonvolatile memory may also be utilizedsuch as a hard disk or a flash storage device. Front side bus (FSB) 906may be a multi-drop or point-to-point interconnect. The term FSB isintended to cover various types of interconnects to processor 901.Chipset 902 may communicate with other devices such as devices 903-904via point-to-point interfaces. Bus 906 may be implemented as a varietyof buses or interconnects, such as, for example, a quick pathinterconnect (QPI), a hyper transport interconnect, or a bus compatiblewith advanced microcontroller bus architecture (AMBA) such as an AMBAhigh-performance bus (AHB).

Cache 908 may be any kind of processor cache, such as level-1 (L1)cache, L2 cache, L3 cache, L4 cache, last-level cache (LLC), or acombination thereof. Cache 908 may be shared with processor cores 909 ofprocessor 901. Cache 908 may be embedded within processor 901 and/orexternal to processor 901. Cache 908 may be shared amongst cores 909.Alternatively, at least one of cores 909 further includes its own localcache embedded therein. At least one of cores 909 may utilize both thelocal cache and the cache shared with another one of cores 909.Processor 901 may further include a direct cache access (DCA) logic toenable other devices such as devices 903-904 to directly access cache908. Processor 901 and/or chipset 902 may further include an interruptcontroller, such as an advanced programmable interrupt controller(APIC), to handle interrupts such as message signaled interrupts.

Some portions of the preceding detailed descriptions have been presentedin terms of algorithms and symbolic representations of operations ondata bits within a computer memory. These algorithmic descriptions andrepresentations are the ways used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of operations leading to adesired result. The operations are those requiring physicalmanipulations of physical quantities.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the above discussion, itis appreciated that throughout the description, discussions utilizingterms such as those set forth in the claims below, refer to the actionand processes of a computer system, or similar electronic computingdevice, that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories into other data similarly represented as physical quantitieswithin the computer system memories or registers or other suchinformation storage, transmission or display devices.

The techniques shown in the figures can be implemented using code anddata stored and executed on one or more electronic devices. Suchelectronic devices store and communicate (internally and/or with otherelectronic devices over a network) code and data using computer-readablemedia, such as non-transitory computer-readable storage media (e.g.,magnetic disks; optical disks; random access memory; read only memory;flash memory devices; phase-change memory) and transitorycomputer-readable transmission media (e.g., electrical, optical,acoustical or other form of propagated signals—such as carrier waves,infrared signals, digital signals).

The processes or methods depicted in the preceding figures may beperformed by processing logic that comprises hardware (e.g. circuitry,dedicated logic, etc.), firmware, software (e.g., embodied on anon-transitory computer readable medium), or a combination of both.Although the processes or methods are described above in terms of somesequential operations, it should be appreciated that some of theoperations described may be performed in a different order. Moreover,some operations may be performed in parallel rather than sequentially.

In the foregoing specification, embodiments of the invention have beendescribed with reference to specific exemplary embodiments thereof. Itwill be evident that various modifications may be made thereto withoutdeparting from the broader spirit and scope of the invention as setforth in the following claims. The specification and drawings are,accordingly, to be regarded in an illustrative sense rather than arestrictive sense.

What is claimed is:
 1. A processor, comprising: one or more executionunits to execute instructions, each having one or more elements indifferent element sizes using one or more registers in differentregister sizes; a counter configured to count a number of instructionsperforming predetermined types of operations that haven been executed bythe one or more execution units; and one or more registers to allow anexternal component to configure the counter to count a number ofinstructions associated with a combination of a register size and aelement size (register/element size) and to retrieve a counter valueproduced by the counter.
 2. The processor of claim 1, further comprisinga retirement unit to retire instructions executed by the one or moreexecution units, the retirement unit configured to instruct the counterto count number of the instructions based on a combination ofregister/element size associated with each instruction retired.
 3. Theprocessor of claim 2, wherein the retirement unit is configured to foreach instruction retired from the one or more execution units, determinea register size and an element size of the retired instruction, select acounter that has been configured to count number of instructionsassociated with the determined register size and element size, andtransmit a signal to the selected counter to cause the selected counterto increment its count value.
 4. The processor of claim 3, wherein theretirement unit is further to select and instruct the counters based ona number of elements operated on by the instructions.
 5. The processorof claim 3, wherein the retirement unit is further to determine whetherthe instruction is a compound instruction that performs multiplepredetermined operations per element, and transmit a signal to anassociated counter to increment with an incremental value equivalent toa number of operations performed by the compound instruction perelement.
 6. The processor of claim 1, wherein the counter is furtherconfigured to count a number of first instructions having a firstcombination of a register size and element size (register/element size)and a number of second instructions having a second combination ofregister/element size that is different than the first combination. 7.The processor of claim 1, wherein a register size is one of 32-bit,64-bit, 128-bit, 256-bit, 512-bit, and 1024-bit instruction width, andwherein an element size represents one of a single precision and doubleprecision.
 8. The processor of claim 1, wherein the external componentis to compute a number of arithmetic operations performed by theinstructions based on the counter value, including applying to thecounter value a predetermined factor that is associated with thecombination of register/element size.
 9. A computer-implemented method,comprising: configuring a counter within a processor having one or moreexecution units to count a number of instructions performingpredetermined types of operations, the one or more execution units toexecute instructions, each having one or more elements in differentelement sizes using one or more registers in different register sizes;counting using the counter to count a number of instructions having acombination of a register size and element size (register/element size)executed by the one or more execution units; and providing access to thecounter to allow an external component to retrieve a counter valueproduced by the counter.
 10. The method of claim 9, further comprisingselecting and instructing the counter to count the instructions based ona combination of register/element size associated with each instructionretired from the one or more execution units.
 11. The method of claim10, further comprising: for each instruction retired from the one ormore execution units, determining a register size and an element size ofthe retired instruction, selecting a counter that has been configured tocount number of instructions associated with the determined registersize and element size, and transmitting a signal to the selected counterto cause the selected counter to increment its count value.
 12. Themethod of claim 11, wherein the retirement unit is further to select andinstruct the counters based on a number of elements operated on by theinstructions.
 13. The method of claim 11, further comprising determiningwhether the instruction is a compound instruction that performs multipleoperations per element, and transmitting a signal to an associatedcounter to increment with an incremental value equivalent to a number ofoperations per element performed by the compound instruction.
 14. Themethod of claim 9, wherein the counter is further configured to count anumber of first instructions having a first combination of a registersize and element size (register/element size) and a number of secondinstructions having a second combination of register/element size thatis different than the first combination.
 15. The method of claim 9,wherein a register size is one of 32-bit, 64-bit, 128-bit, 256-bit,512-bit, and 1024-bit instruction width, and wherein an element sizerepresents one of a single precision and double precision.
 16. Themethod of claim 9, wherein the external component is to compute a numberof arithmetic operations performed by the instructions based on thecounter value, including applying to the counter value a predeterminedfactor that is associated with the combination of register/element size.17. A data processing system, comprising: a dynamic random-access memory(DRAM); and a processor coupled to the DRAM, the processor including oneor more execution units to execute instructions, each having one or moreelements in different element sizes using one or more registers indifferent register sizes, a counter configured to count a number ofinstructions performing predetermined types of operations executed bythe one or more execution units, and one or more registers to allow anexternal component to configure the counter to count a number ofinstructions associated with a combination of a register size and aelement size (register/element size) and to retrieve a counter valueproduced by the counter.
 18. The system of claim 17, wherein theprocessor further comprises a retirement unit to retire instructionsexecuted by the one or more execution units, the retirement unitconfigured to select and instruct the counter to count the instructionsbased on a combination of register/element size associated with eachinstruction retired.
 19. The system of claim 18, wherein the retirementunit is configured to for each instruction retired from the one or moreexecution units, determine a register size and an element size of theretired instruction, select a counter that has been programmed to countnumber of instructions associated with the determined register size andelement size, and transmit a signal to the selected counter to cause theselected counter to increment its count value.
 20. The system of claim19, wherein the retirement unit is further to select and instruct thecounters based on a number of elements operated on by the instructions.